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Abstract 

We develop novel first- and second-order fea¬ 
tures for dependency parsing based on the 
Google Syntactic Ngrams corpus, a collection 
of subtree counts of parsed sentences from 
scanned books. We also extend previous work 
on surface //-gram features from Web IT to the 
Google Books corpus and from first-order to 
second-order, comparing and analysing per¬ 
formance over newswire and web treebanks. 

Surface and syntactic //-grams both produce 
substantial and complementary gains in pars¬ 
ing accuracy across domains. Our best sys¬ 
tem combines the two feature sets, achieving 
up to 0.8% absolute UAS improvements on 
newswire and 1.4% on web text. 


1 Introduction 


Current state-of-the-art parsers score over 90% on 
the standard newswire evaluation, but the remaining 
errors are difficult to overcome using only the train¬ 
ing corpus. Features from //-gram counts over re¬ 


sources like Web IT (Brants and Franz 2006) have 


proven to be useful proxies for syntax (Bansal and 


Klein 2011; Pitler 2012), but they enforce linear 
word order, and are unable to distinguish between 
syntactic and non-syntactic co-occurrences. Longer 
//-grams are also noisier and sparser, limiting the 
range of potential features. 

In this paper we develop new features for the 


graph-based MSTParser (McDonald and Pereira 


2006) from the Google Syntactic Ngrams coipus 


( Goldberg and Orwant; 20131, a collection of Stan¬ 
ford dependency subtree counts. These features cap¬ 
ture information collated across millions of subtrees 


produced by a shift-reduce parser, trading off po¬ 
tential systemic parser errors for data that is better 
aligned with the parsing task. We compare the per¬ 
formance of our syntactic //-gram features against 


the surface //-grain features of Bansal and Klein 
( |2011| ) in-domain on newswire and out-of-domain 


on the English Web Treebank (Petrov and McDon 


aid 2012) across CoNLL-style (LTH) dependencies. 
We also extend the first-order surface //-gram fea¬ 
tures to second-order, and compare the utility of 


Web IT and the Google Books Ngram corpus (Lin 


et al. 2012) as surface //-gram sources. 

We find that surface and syntactic //-grams pro¬ 
vide statistically significant and complementary ac¬ 
curacy improvements in- and out-of-domain. Our 
best LTH system combines the two feature sets to 
achieve 92.5% unlabeled attachment score (UAS) on 
newswire and 85.2% UAS averaged over web text 
on a baseline of 91.7% and 83.8%. Our analysis 
shows that the combined system is able to draw upon 
the strengths of both surface and syntactic features 
whilst avoiding their weaknesses. 

2 Syntactic n-gram Features 

The Google Syntactic Ngrams English (2013) cor- 
pu^j] contains counts of dependency free fragments 
over a 345 billion word selection of the Google 
Books data, parsed with a beam-search shift-reduce 
parser and Stanford dependencies ([Goldberg and Or- 
2013). The parser is trained over substantially 


want 


more annotated data than typically used in depen¬ 
dency parsing. 

Unlike surface //-grams, syntactic //-grams are not 
restricted to linear word order or affected by non- 
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Feature Lookup 

Count 

Bucket 

hold (head) 

80,129k 

4 

hearing (arg) 

7,839k 

4 

hold —> hearing 

15k 

3 

hold —> hearing (head left) 

15k 

3 

VB (head) 

20,996,911k 

5 

NN (arg) 

22,163,825k 

5 

VB (child right) 

6,261,484k 

5 

NN (head left) 

15,478,472k 

5 

VB -7- NN 

1,784,891k 

5 

VB -7- NN (head left) 

1,437,932k 

5 

hold —> NN 

7,362k 

4 

hold —> NN (head left) 

6,248k 

4 

VB hearing 

396k 

3 

VB —> hearing (head left) 

354k 

3 


Table 1: Syntactic 77-gram features, their counts in the 
extended arcs dataset, and the bucketed count for the hold 
and hearing dependency. 


syntactic co-occurrence. Given a head-argument 
ambiguity, we extract different combinations of 
word, POS tag, and directionality, and search the 
Syntactic Ngrams corpus for matching subtrees. To 
reduce the impact of this search during run time, we 
extract all possible combinations in the training and 
test corpora ahead of time and total the frequencies 
of each configuration, storing these in a lookup table 
that is used by the parser at run-time to compute fea¬ 
ture values. We did not use any features based on the 
dependency label as these are assigned in a separate 
pass in MSTParser. 

Table Q] summarizes the first-order features ex¬ 
tracted from the dependency hold —> hearing de¬ 
picted in Figure [T] The final feature encodes the 
POS tags of the head and argument, directionality, 
the binned distance between the head and argument, 
and a bucketed frequency of the syntactic 77 -gram 
calculated as per Equation [T| creating bucket labels 
from 0 in increments of 5 (0, 5, 10, etc.). 


bucket = 


log 2 (£ frequency) 
5 


x 5 


( 1 ) 


Additional features for each bucket value up to 
the maximum are also encoded. We also develop 


paraphrase-style features like those of Bansal and 


Klein ( 2011 j > based on the most frequently occur¬ 


ring words and POS tags before, in between, and af¬ 
ter each head-argument ambiguity (see Section [T2]). 


_ - tmocL _ 



could hold a public hearing next week 

Figure 1: The paraphrase-style context words around 
ho/d —>hearing in a syntactic 77-gram. Context words are 
italicized and their arcs dashed. 


Figure |T] depicts the potential context words avail¬ 
able the hold —> hearing dependency. 

We experiment with a number of second-order 
features, mirroring those extracted for surface 77 - 
grams in Section 3.3 We extract all triple and 


sibling word and POS structures considered by the 
parser in the training and test corpora (following the 
factorization depicted in Figure[2j), and counted their 
frequency in the Syntactic Ngrams corpus. Impor¬ 
tantly, we require that matching subtrees in the Syn¬ 
tactic Ngrams coipus maintain the position of the 
parent relative to its children. We generate separate 
features encoding the word and POS tag variants of 
each triple and sibling structure. 

Similar to the surface 7i-gram features (Section[3]), 
counts for our syntactic 77 -gram features are pre¬ 
computed to improve the run-time efficiency of the 
parser. Experiments on the development set led to 
a minimum cutoff frequency of 10,000 for each fea¬ 
ture to avoid noise from parser and OCR errors. 


3 Surface n-gram Features 


Bansal and Klein ( |2011 [ ) demonstrate that features 
generated from bucketing simple surface 77 -gram 
counts and collecting the top paraphrase-based con¬ 
textual words over Web IT are useful for almost all 
attachment decisions, boosting dependency parsing 
accuracy by up to 0.6%. However, this technique 
is restricted to counts based purely on the linear or¬ 
der of the adjacent words, and is unable to incorpo¬ 
rate disambiguating information such as POS tags to 


avoid spurious counts. Bansal and Klein (2011) also 
tested only on in-domain text, though these external 
count features should be useful out of domain. 

We extract Bansal and Klein](20111’s affinity and 
paraphrase-style first-order features from the Google 
Books English Ngrams corpus, and compare their 




































performance against WeblT counts. Both corpora 
are very large, contain different types of noise, and 
are sourced from very different underlying texts. We 
also extend Bansal and Klein’s affinity and para¬ 
phrase features to second-order. 

3.1 Surface n-gram Corpora 

The WeblT coipus contains counts of 1 to 5-grams 


over 1 trillion words of web text (Brants and Franz 


2006). Unigrams must appear at least 200 times in 


the source text before being included in the corpus, 


while longer //-grams have a cutoff of 40. Pitler et al. 
( j2010j ) has documented a number of sources of noise 
in the corpus, including duplicate sentences (such 
as legal disclaimers and boilerplate text), dispropor¬ 
tionately short or long sentences, and primarily al¬ 
phanumeric sentences. 

The Google Books Ngrams English corpus (2012) 
contains counts of 1 to 5-grams over 468 billion 
words sourced from scanned books published across 
three centuries (Mi chel et al. 201 1[ >. A uniform cut¬ 
off of 40 applies to all //-grams in this corpus. This 
corpus is affected by the accuracy of OCR and dig¬ 
itization tools; the changing typography of books 
across time is one issue that may create spurious co¬ 


occurrences and counts (Lin et al. 2012). 


3.2 First-order surface n-gram features 

Affinity features rely on the intuition that frequently 
co-occurring words in large unlabeled text collec¬ 
tions are likely to be in a syntactic relationship 


( jNakov and Hearstl |2005[ |Bansal and Klein[[201 1| ). 

A-gramresources such as WeblT and Google Books 
provide large offline collections from which these 
co-occurrence statistics can be harvested; given each 
head and argument ambiguity in a training and test 
corpus, the coipora can be linearly scanned ahead 
of parsing time to reduce the impact of querying in 
the parser. When scanning, the head and argument 
word may appear immediately adjacent to one an¬ 
other in linear order (CONTIG), or with up to three 
intervening words (GAPl, GAP2, and GAP3) as the 
maximum //-gram length is five. The total count is 
then discretized as per Equation [^previously. 

The final parser features include the POS tags 
of the potential head and argument, the discretized 
count, directionality, and the binned length of the de¬ 
pendency. Additional cumulative features are gener¬ 


ated using each bucket from the pre-calculated up to 
the maximum bucket size. 

Paraphrase-style surface //-gram features attempt 


to infer attachments indirectly. Nakov and Hearst 


(2005) propose several static patterns to resolve a va¬ 
riety of nominal and prepositional attachment ambi¬ 
guities. For example, they give the example of sen¬ 
tence (1) below, paraphrase it into sentence (2), and 
examine how frequent the paraphrase is. If it should 
happen sufficiently often, this serves as evidence for 
the nominal attachment to demands in sentence (1) 
rather than the verbal attachment to meet. 


1 . meet demands from customers 

2 . meet the customers demands 


In Bansal and Klein ( 201lj ), paraphrase features 
are generated for all full-parse attachment ambigu¬ 
ities from the surface n-gram coipus. For each at¬ 
tachment ambiguity, 3-grams of the form (* q\ q 2 ), 
(q\ -k q 2 ), and (q\ q 2 ★) are extracted, where q\ and 
q 2 are the head and argument in their linear order 
of appearance in the original sentence, and * is any 
single context word appearing before, in between, 
or after the query words. Then the most frequent 
words appearing in each of these configurations for 
each head-argument ambiguity is encoded as a fea¬ 
ture with the POS tags of the head and argumen 0 
Given the arc hold —> hearing in Figure [2} public 
is the most frequent word appearing in the //-gram 
(hold k hearing) in WeblT. Thus, the final encoded 
feature is POS (hold) A POS (hearing) A public A 
mid. Further generalization is achieved by using a 
unigram POS tagger trained on the WSJ data to tag 
each context word, and encoding features using each 
unique tag of the most frequent context words. 


3.3 Second-order surface n-gram features 

We extend the first-order surface n -gram features 
to new features over second-order structures. We 
experimented with triple and sibling features, re¬ 
flecting the second-order factorization used in MST- 
Parser (see Figure [2]). 

As with first-order features, we convert all triple 
and sibling structures from the training and test 

2 The top 20 words in between and top 5 words before and 
after are used for all paraphrase-style features in this paper. 




































hold a hearing next Tuesday 

Figure 2: The second-order factorization used in MST- 
Parser, with a parent and two adjacent children. 







BKS 

% A 

LTH 

MST 

WEB 

BKS 

SYN 

SYN 

MST 

WSJ 22 

92.3 

92.9 

92.9 

92.7 

93.2 

+0.9 

WSJ 23 

91.7 

92.2 

92.3 

92.4 

92.6 

+0.9 

EWT ANS 

82.5 

83.4 

83.2 

83.6 

83.6 

+ 1.1 

EWT NGS 

85.2 

86.1 

86.1 

86.1 

86.4 

+0.9 

EWT REV 

83.6 

84.5 

84.3 

84.9 

85.0 

+ 1.3 

EWT AVG 

83.8 

84.6 

84.5 

84.8 

85.0 

+ 1.2 


data into query //-grams, maintaining their linear or¬ 
der. In Figure [2j the corresponding //-grams are 
hold hearing Tuesday, and hearing Tuesday. We 
then scan the n -gram cotpus for each query n -gram 
and sum the frequency of each. Frequencies are 
summed over each configuration (including inter¬ 
vening words) that the query //-gram may appear in, 
as depicted below. 

• (<7i <72 <73) • (<7i * <72 * < 73 ) 

• (<7i * <72 < 73 ) • (<7i * * <72 < 73 ) 

• (<7i <72 * < 73 ) • (<7i <72 * * < 73 ) 

where q \, qi, and <73 are the words of the triple in 
their linear order, and * is a single intervening word 
of any kind. 

We encode the POS tags of the parent and children 
(or just the children for sibling features), along with 
the bucketed count, directionality, and the binned 
distance between the two children. We also extract 
paraphrase-style features for siblings in the same 
way as for first-order //-grams, and cumulative vari¬ 
ants up to the maximum bucket size. 


4 Experimental Setup 


As with jBansal and Klein| ( |20 iTj ) and |Pitler| ( |2012| ), 
we convert the Penn Treebank to dependencies us¬ 
ing pennconvertei0 (Johansson and Nugues, 2007) 
(henceforth LTH) and generate POS tags with MX- 
POST (Ratnap arkhi| 1996 1 . We used sections 02-21 
of the WSJ for training, 22 for development, and 23 
for final testing. The test sections of the answers, 
newsgroups, and reviews sections of the English 
Web Treebank as per the SANCL 2012 Shared Task 
(Petrov^ and McDonald, [20121 ) were converted to 
LTH and used for out-of-domain evaluation. We used 


MSTParser (McDonald and Pereira 2006), trained 
with the parameters order:2, training-k:5 , iters:10, 
and loss-type:nopunc. We omit labeled attachment 
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Table 2: LTH UAS (MSTParser) on the WSJ dev and test 
set, and English Web Treebank (EWT) answers (ANS), 
newsgroups (NGS), and reviews (REV) test set for the 
baseline (BASE), WeblT (WEB), Google Books (BKS), 
Syntactic (SYN), and combined (BKS + SYN) feature sets. 
All results are statistically significant improvements over 
the baseline. 


scores in this paper for brevity, but they are consis¬ 
tent with the reported UAS scores. 

5 Results 

Table [2] summarizes our results over the WSJ devel¬ 
opment and test datasets, and the SANCL 2012 test 
datasets. All of our features perform very similarly 
to one another: each feature set in isolation provides 
a roughly 0.5% UAS improvement over the baseline 
parser on the WSJ development and test sections. On 
the out-of-domain web treebank, surface and syntac¬ 
tic features each improve over the baseline by an av¬ 
erage of roughly 0.8 - 1.0% on the test sets. All of 
our results are also statistically significant improve¬ 
ments over the baseline. 

While our syntactic //-gram counts are computed 
over Stanford dependencies and almost certainly in¬ 
clude substantial parser and OCR errors, they still 
provide a significant performance improvement for 
LTH parsing. Additionally, the Syntactic Ngrams 
dataset is drawn from a wide variety of genres, but 
helps with both newswire and web text parsing. 

The best results on LTH dependencies used 
second-order sibling features in addition to the first- 
order features for both surface and syntactic //- 
grams. A combined system of Google Books sur¬ 
face //-gram features and syntactic //-gram features 
(which performed individually best on the develop¬ 
ment set) produces absolute UAS improvements of 
0.8% over the baseline on the WSJ test set, and 1.4% 
over the baseline averaged across the three web tree- 




































Figure 3: Total LTH attachment errors by gold argument 
POS tag, sorted by the total tag frequency. 


Tag 

Freq 

BASE 

COMB 

% 

NN 

5725 

5433 

5470 

12.0 

NNP 

4043 

3810 

3843 

10.7 

IN 

4026 

3457 

3513 

18.2 

DT 

3511 

3425 

3431 

2.0 

NNS 

2504 

2344 

2379 

11.4 

JJ 

2472 

2314 

2335 

6.8 

CD 

1845 

1736 

1739 

1.0 

VBD 

1705 

1579 

1606 

OO 

OO 

RB 

1308 

1091 

1106 

4.9 

CC 

1000 

848 

850 

0.7 

VB 

983 

941 

947 

2.0 

TO 

868 

766 

784 

5.8 

VBN 

850 

783 

792 

2.9 

VBZ 

705 

636 

638 

0.7 

PRP 

612 

604 

606 

0.7 

VBG 

588 

500 

511 

3.6 

POS 

428 

422 

422 

0.0 

$ 

352 

345 

343 

-0.7 

MD 

344 

307 

313 

2.0 

VBP 

341 

298 

305 

2.3 

PRP$ 

288 

281 

280 

-0.3 

Other 

1010 

868 

883 

4.9 


Table 3: Correct attachments by gold argument POS tag 
and the percentage of the overall error reduction over WSJ 
section 22 for the baseline and combined systems in Ta¬ 
ble 121 


bank testing domains. These results are significantly 
higher than any feature set in isolation, showing that 
surface and syntactic //-gram features are comple¬ 
mentary and individually address different types of 
eiTors being made by the parser. 

6 Analysis 

Figure [3] gives an error breakdown by high- 
frequency gold argument POS tag on LTH dependen¬ 
cies for the baseline, WeblT surface n-grams, syn¬ 
tactic n-grams, and combined systems reported in 
Table [2] For almost every POS tag, the combined 
system outperforms the baseline and makes equal 
or fewer errors than either the surface or syntactic 
n-gram features in isolation. Syntactic //-grams are 
worse relative to surface //-grams on noun, adjec¬ 
tival, and prepositional parts of speech - construc¬ 
tions which are known to be difficult to parse. With¬ 
out NP-bracketed training data or the extra features 


that we have discussed as helping resolve these is¬ 
sues, it is unsurprising that syntactic //-gram fea¬ 


tures using the counts from the Goldberg and Or- 


want (20131 parser are less effective. In compari¬ 


son, surface //-grams are worse on conjunctive and 
verbal parts of speech, suggesting that the localized 
nature of these features is less useful for the idiosyn¬ 
crasies of coordination representations and longer- 
range subject/object relationships. 


Whilst WeblT and Google Books features per¬ 
form similarly overall. Books //-grams are more ef¬ 
fective for noun structures, and WeblT //-grams are 
slightly better in predicting PP attachment sites. 

Table [3] lists a complete breakdown of correct at¬ 
tachments corrected by the combined system. The 
most substantial gains come in nominal and prepo¬ 
sitional phrases - known weaknesses for parsers, 
and the categories where syntactic //-gram features 



























































































































Corpus 

Not Present 

% 

Google Books 

1,714,631 

32.5 

Web IT 

1,425,347 

27.0 

Intersection 

1,301,090 

24.7 


Table 4: Surface //-gram queries from the WSJ and En¬ 
glish Web Treebank that do not receive features from 
Web IT and Google Books. 


alone fare worst. However, the system finds less im¬ 
provement in coordinators, determiners, and cardi¬ 
nal numbers, all of which are also components of 
noun phrases. This shows the difficulty of correctly 
identifying a head noun in a nominal to attach mod¬ 
ifiers to, and the general difficulty of representing 
and parsing coordination. 

Web IT contains approximately double the total 
number of //-grams as Google Books. Table [4] 
shows that 27% and 32.5% of the //-gram queries 
from the WSJ sections 2-23 and the entire English 
Web Treebank do not receive features from Web IT 
and Google Books respectively. The intersection 
of these queries is 24.7% of the total, showing that 
the two corpora have small but substantive differ¬ 
ences in word distributions; this may partially ex¬ 
plain why our combined feature experiments work 
so well. However, the similar performance of sur¬ 
face /i-gram features extracted from these sources 
suggests Web IT contains substantial noise. 


We had expected our syntactic n -gram features 
to perform better than they did since they address 
many of the shortcomings of using surface //-grams. 
Syntactic features are sensitive to the quality of the 
parser used to produce them, but in this case the 
parser is difficult to assess as the source corpus is 
enormous and extracted using OCR from scanned 
books. Even if the parser is state of the art, it is being 
used to parse diverse texts spanning multiple genres 
across a wide time period, compounded by poten¬ 
tial scanning and digitization errors. Additionally, 
a post-hoc analysis of the types of errors present in 
the corpus is impossible due to the exclusion of the 
full parse trees, though Goldberg and Orwant (20131 
note that this data would almost certainly be compu¬ 
tationally prohibitive to process. Despite this, our 
work has shown that counts from this corpus pro¬ 
vide useful features for parsing. Futhermore, these 


features stack with surface n -gram features, provid¬ 
ing substantial overall performance improvements. 

6.1 Future Work 

A combination of features from all of the sources 
used in this work would be interesting avenues for 
further investigation, especially since these features 
seem strongly complementary. We could also ex¬ 
plore more of the POS and head-modifier annotations 
available in the Google Books Ngram corpus to de¬ 
velop features which are a middle ground between 
surface and syntactic //-gram features. 

The Google Books and Syntactic Ngrams corpora 
both provide frequencies by date, and it would be in¬ 
teresting to explore how well features extracted from 
different date ranges would perform - particularly 
on text from roughly the same eras. Resampling 
Web IT to reduce it to a compar able corpus that is 
the same size as Google Books would also provide 
better insights on how many //-grams are noise. 


7 Related Work 


Surface n -gram counts from large web corpora have 
been used to address NP and PP attachment er- 


rors (Volk 

2001 

Nakov and Hearst 

2005 

) Aside 

from Bansal and Klein 

([2011), other feature-based 


approaches to improving dependency parsing in- 
clude |Pitler] ( j2012| ), who exploits Brown clusters and 
point-wise mutual information of surface //-gram 
counts to specifically address PP and coordination 
Chen et al. ^|2013 l describe a novel way of 


errors. 


generating meta-features that work to emphasise im¬ 
portant feature types used by the parser. 


Chen et al. (2009) generate subtree-based fea¬ 


tures that are similar to ours. However, they use 
the in-domain BLLIP news wire coipus to generate 
their subtree counts, whereas the Syntactic Ngrams 
corpus is out-of-domain and an order of magni¬ 
tude larger. They also use the same underlying 
parser to generate the BLLIP subtree counts and as 
the final test-time parser, while Syntactic Ngrams is 
parsed with a simpler, shift-reduce parser compared 
to the graph-based MSTParser used during test time. 
They also evaluate only on newswire text, whilst our 
work systematically explores various configurations 
of surface and syntactic //-gram features in- and out- 
of-domain. 









































8 Conclusion 

We developed features for dependency parsing us¬ 
ing subtree counts from 345 billion parsed words 
of scanned English books. We extended existing 
work on surface n-grams from first to second-order, 
and investigated the utility of web text and scanned 
books as sources of surface n-grams. 

Our individual feature sets all perform similarly, 
providing significant improvements in parsing accu¬ 
racy of about 0.5% on news wire and up to 1.0% av¬ 
eraged across the web treebank domains. They arc 
also complementary, with our best system combin¬ 
ing surface and syntactic /i-gram features to achieve 
up to 1.3% UAS improvements on newswire and 
1.6% on web text. We hope that our work will en¬ 
courage further efforts to unify different sources of 
unlabeled and automatically parsed data for depen¬ 
dency parsing, addressing the relative strengths and 
weaknesses of each source. 
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